Libraries

Firstly, we are going to import the necessary libraries for our time series analysis. The libraries vary in their utilities, ranging from the most general ones such as tidyverse, up to tseries for dealing with time-series objects.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(TTR)
library(fpp)
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: tseries
library(tseries)
library(TSstudio)
library(padr)

Import Data

Subsequently, we can import our train and test data-set for this analysis, and perform an examination on both data frames by using the glimpse function to understand further their nature. Below are the column names along with their sample data.

#read data into train and test variables respectively
train <- read.csv("data/data-train.csv")
test <- read.csv("data/data-test.csv")
glimpse(train)
## Rows: 137,748
## Columns: 10
## $ transaction_date <chr> "2017-12-01T13:32:46Z", "2017-12-01T13:32:46Z", "201…
## $ receipt_number   <chr> "A0026694", "A0026694", "A0026695", "A0026695", "A00…
## $ item_id          <chr> "I10100139", "I10500037", "I10500044", "I10400009", …
## $ item_group       <chr> "noodle_dish", "drinks", "drinks", "side_dish", "dri…
## $ item_major_group <chr> "food", "beverages", "beverages", "food", "beverages…
## $ quantity         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1…
## $ price_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12…
## $ total_usd        <dbl> 7.33, 4.12, 2.02, 5.60, 3.01, 4.86, 6.34, 7.58, 4.12…
## $ payment_type     <chr> "cash", "cash", "cash", "cash", "cash", "cash", "cas…
## $ sales_type       <chr> "dine_in", "dine_in", "dine_in", "dine_in", "dine_in…
glimpse(test)
## Rows: 91
## Columns: 2
## $ datetime <chr> "2018-02-19T10:00:00Z", "2018-02-19T11:00:00Z", "2018-02-19T…
## $ visitor  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

The dataset includes information about:

  • transaction_date: The timestamp of a transaction
  • receipt_number: The ID of a transaction
  • item_id: The ID of an item in a transaction
  • item_group: The group ID of an item in a transaction
  • item_major_group: The major-group ID of an item in a transaction
  • quantity: The quantity of purchased item
  • price_usd: The price of purchased item
  • total_usd: The total price of purchased item
  • payment_type: The payment method
  • sales_type: The sales method

Data preprocessing

As the transaction data is in a character vector or else known as a string format along with insignificant characters cluttering the space, we need to do a sub-setting first to transform them into a desired format for further processing.

library(stringr)
train$transaction_date <- paste(train$transaction_date %>% substr(1,10), train$transaction_date %>% substr(12, 19), sep = " ")

train$transaction_date %>% head()
## [1] "2017-12-01 13:32:46" "2017-12-01 13:32:46" "2017-12-01 13:33:39"
## [4] "2017-12-01 13:33:39" "2017-12-01 13:33:39" "2017-12-01 13:35:59"

Here, we change the data from character into a POSIXct format.

train <- train %>% 
  mutate(transaction_date = as.POSIXct(transaction_date, tz = "", format("%Y-%m-%d %H:%M:%S")))
range(train$transaction_date)
## [1] "2017-12-01 13:32:46 HKT" "2018-02-18 23:58:32 HKT"
colSums(is.na(train))
## transaction_date   receipt_number          item_id       item_group 
##                0                0                0                0 
## item_major_group         quantity        price_usd        total_usd 
##                0                0                0                0 
##     payment_type       sales_type 
##                0                0

Estimating number of visitors

Estimation from distinct number of receipt id within an hour

Assuming that there are n number of visitors for n number of receipts that are generated within an hour.

#estimate the number of visitors in an hour
train_clean <- train %>% 
  mutate(transaction_date = as.POSIXct(cut(train$transaction_date, breaks = "hour"))) %>% 
  group_by(transaction_date) %>% 
  summarise(visitor = n_distinct(receipt_number))
## `summarise()` ungrouping output (override with `.groups` argument)
train_clean

Estimation from total amount of foods ordered id within a single transaction

Make another data frame which might be more suitable for the model by estimating the number of visitors through analyzing the total amount of foods ordered per transaction (assuming there is one food ordered per visitor).

train_clean2 <- train %>% 
  mutate(transaction_date = as.POSIXct(cut(train$transaction_date, breaks = "hour"))) %>% 
  group_by(transaction_date, item_major_group) %>% 
  mutate(total = 1:n()) %>% 
  filter(item_major_group == "food") %>% 
  group_by(transaction_date) %>% 
  summarise(visitor = max(total))
## `summarise()` ungrouping output (override with `.groups` argument)
train_clean2

Time Range

Here, we check the starting and ending time of the recording period for both estimations.

n visitors for n receipts

range(train_clean$transaction_date)
## [1] "2017-12-01 13:00:00 HKT" "2018-02-18 23:00:00 HKT"

n visitors for n foods ordered within a single transaction

range(train_clean2$transaction_date)
## [1] "2017-12-01 13:00:00 HKT" "2018-02-18 23:00:00 HKT"

Padding

Padding towards both data frames is necessary to assure that our time series data will have a complete hourly interval data.

n visitors for n receipts

train_clean <- train_clean %>% pad(interval = "hour")
train_clean
colSums(is.na(train_clean))
## transaction_date          visitor 
##                0              663

n visitors for n foods ordered within a single transaction

train_clean2 <- train_clean2 %>% pad(interval = "hour")
train_clean2
colSums(is.na(train_clean2))
## transaction_date          visitor 
##                0              665

Filtering the data frames

Assuming that the store opens at 10am and closes at 10pm, it is reasonable to filter only the time within the range of 10 up until the store closes (in this case from 10am until the last order around 10pm). Additionally, we also fill the missing values with 0 following the assumption that there is no records of receipt as there is no visitor at a particular time range.

n visitors for n receipts

# train_clean <- na.omit(train_clean)
train_new <- train_clean %>% 
  filter(format(transaction_date, '%H') >= 10.00 & format(transaction_date, '%H') <= 22.00)
colSums(is.na(train_new))
## transaction_date          visitor 
##                0               33
train_new <- train_new %>% mutate(visitor = na.fill(visitor, 0))
anyNA(train_new)
## [1] FALSE
start <- min(train_new$transaction_date)
start
## [1] "2017-12-01 13:00:00 HKT"
end <- max(train_new$transaction_date)
end
## [1] "2018-02-18 22:00:00 HKT"

n visitors for n foods ordered within a single transaction

train_new2 <- train_clean2 %>% 
  filter(format(transaction_date, '%H') >= 10.00 & format(transaction_date, '%H') <= 22.00)
colSums(is.na(train_new2))
## transaction_date          visitor 
##                0               33
train_new2 <- train_new2 %>% mutate(visitor = na.fill(visitor, 0))
anyNA(train_new2)
## [1] FALSE
start2 <- min(train_new2$transaction_date)
start2
## [1] "2017-12-01 13:00:00 HKT"
end2 <- max(train_new2$transaction_date)
end2
## [1] "2018-02-18 22:00:00 HKT"

Visualization of the data

n visitors for n receipts

train_new %>%
  mutate(transaction_date = format(transaction_date, "%H:%M")) %>% 
  group_by(transaction_date) %>% 
  summarise(avg.visitors = mean(visitor)) %>% 
  ggplot(aes(transaction_date, avg.visitors)) +
  geom_col(aes(fill = avg.visitors)) +
  labs(title = "Average visitors per hour")
## `summarise()` ungrouping output (override with `.groups` argument)

  scale_fill_continuous(low = "gray", high = "black")
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1
 train_new %>% 
  mutate(transaction_date = format(transaction_date,'%Y-%m-%d')) %>% 
  group_by(transaction_date) %>% 
  mutate(total = sum(visitor)) %>% 
  mutate(transaction_date = wday(transaction_date,label = TRUE, abbr = FALSE)) %>% 
  group_by(transaction_date) %>% 
  summarise(avg.visitors = mean(total)) %>% 
  ggplot(aes(transaction_date, avg.visitors)) +
  geom_col(aes(fill = avg.visitors)) +
  labs(title = "Average visitors per day")
## `summarise()` ungrouping output (override with `.groups` argument)

  scale_fill_continuous(low = "gray", high = "black")
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1

n visitors for n foods ordered within a single transaction

train_new2 %>%
  mutate(transaction_date = format(transaction_date, "%H:%M")) %>% 
  group_by(transaction_date) %>% 
  summarise(avg.visitors = mean(visitor)) %>% 
  ggplot(aes(transaction_date, avg.visitors)) +
  geom_col(aes(fill = avg.visitors)) +
  labs(title = "Average visitors per hour")
## `summarise()` ungrouping output (override with `.groups` argument)

  scale_fill_continuous(low = "gray", high = "black")
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1
 train_new2 %>% 
  mutate(transaction_date = format(transaction_date,'%Y-%m-%d')) %>% 
  group_by(transaction_date) %>% 
  mutate(total = sum(visitor)) %>% 
  mutate(transaction_date = wday(transaction_date,label = TRUE, abbr = FALSE)) %>% 
  group_by(transaction_date) %>% 
  summarise(avg.visitors = mean(total)) %>% 
  ggplot(aes(transaction_date, avg.visitors)) +
  geom_col(aes(fill = avg.visitors)) +
  labs(title = "Average visitors per day")
## `summarise()` ungrouping output (override with `.groups` argument)

  scale_fill_continuous(low = "gray", high = "black")
## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1

From the above data visualizations, despite a significant difference in the number of visitors for both data frames as a result of different approximation, it can be seen that there is a recurring seasonality pattern for daily and weekly observations of both data frames respectively. It can be inferred from the plots above that the average visitors peak at around 8pm for hourly analysis, and Saturday for weekly analysis, which might suggest that there are multiple seasonality patterns in our data frames.

Time Series model

In this section, we are going to make time series models from our processed data frames, analyze their seasonality patterns, forecast them using various methods, and examine the results.

Make time series models

n visitors for n receipts

train_ts <- ts(data = train_new$visitor, start =  start, frequency = 13)
train_ts %>% autoplot()

### n visitors for n foods ordered within a single transaction

train_ts2 <- ts(data = train_new2$visitor, start =  start, frequency = 13)
train_ts2 %>% autoplot()

Decomposition

n visitors for n receipts

train_decomp <- train_ts %>% decompose()
train_decomp %>% autoplot()

n visitors for n foods ordered within a single transaction

train_decomp2 <- train_ts2 %>% decompose()
train_decomp2 %>% autoplot()

#Make time series models with multiple seasonalities

n visitors for n receipts

train_msts<-train_new$visitor %>% msts(seasonal.periods = c(13,13*7))
train_msts_decomp <- train_msts %>% mstl()
train_msts_decomp %>% autoplot()

train_new %>% 
  mutate(Hour = hour(transaction_date), Seasonal = train_decomp$seasonal) %>% 
  distinct(Hour, Seasonal) %>% 
  ggplot(aes(x = Hour, y = Seasonal)) +
  geom_col(aes(fill = Seasonal))+
  scale_fill_gradient(low = "black", high = "blue") +
  labs(title = "Plot of seasonal against hour") 

train_weekly <- data.frame(train_msts_decomp)

train_weekly %>%
  mutate(date = train_new$transaction_date) %>% 
  mutate(Day  = wday(date, label = TRUE, abbr = FALSE), Hour = (hour(date))) %>% 
  group_by(Day, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day), stat ="identity",  width = 0.7)+
  scale_x_continuous(breaks = seq(10,22,1))
## `summarise()` regrouping output by 'Day' (override with `.groups` argument)

  labs(title = "Multi-Seasonality Analysis  - Weekly & Hourly") 
## $title
## [1] "Multi-Seasonality Analysis  - Weekly & Hourly"
## 
## attr(,"class")
## [1] "labels"

n visitors for n foods ordered within a single transaction

train_msts2<-train_new2$visitor %>% msts(seasonal.periods = c(13,13*7))
train_msts_decomp2 <- train_msts2 %>% mstl()
train_msts_decomp2 %>% autoplot()

train_new2 %>% 
  mutate(Hour = hour(transaction_date), Seasonal = train_decomp2$seasonal) %>% 
  distinct(Hour, Seasonal) %>% 
  ggplot(aes(x = Hour, y = Seasonal)) +
  geom_col(aes(fill = Seasonal))+
  scale_fill_gradient(low = "black", high = "blue") +
  labs(title = "Plot of seasonal against hour") 

train_weekly2 <- data.frame(train_msts_decomp2)

train_weekly2 %>%
  mutate(date = train_new$transaction_date) %>% 
  mutate(Day  = wday(date, label = TRUE, abbr = FALSE), Hour = (hour(date))) %>% 
  group_by(Day, Hour) %>%
  summarise(Seasonal = sum(Seasonal13 + Seasonal91)) %>%
  ggplot() +
  geom_bar(aes(x = Hour, y = Seasonal, fill = Day), stat ="identity",  width = 0.7)+
  scale_x_continuous(breaks = seq(10,22,1))
## `summarise()` regrouping output by 'Day' (override with `.groups` argument)

  labs(title = "Multi-Seasonality Analysis  - Weekly & Hourly") 
## $title
## [1] "Multi-Seasonality Analysis  - Weekly & Hourly"
## 
## attr(,"class")
## [1] "labels"

From the above visualizations, we can be confident that our previous assumptions remain true, there really are multiple seasonalities for both hourly interval and daily interval.

Modeling Fitting and Cross-Validation

Cross-validation

In this section, we separate the train_msts data frame into testing_new variable for the last week of the available data to validate our model with training_new being the rest of the data (the same goes for train_msts2), and determine which model performs the best.

testing_msts <- train_msts %>% tail(13*7)
training_msts <- train_msts %>% head(length(train_new) - 13*7)
testing_msts2 <- train_msts2 %>% tail(13*7)
training_msts2 <- train_msts2 %>% head(length(train_new2) - 13*7)

Modeling

Here, we will make 3 different models to be evaluated, namely HoltWinters model, ETS model, and Arima model.

n visitors for n receipts

model_holt_msts <- HoltWinters(training_msts)
model_stlm_ets <- training_msts %>% stlm(method = "ets")
model_stlm_arima <- training_msts %>% stlm(method = "arima")

n visitors for n foods ordered within a single transaction

model_holt_msts2 <- HoltWinters(training_msts2)
model_stlm_ets2 <- training_msts2 %>% stlm(method = "ets")
model_stlm_arima2 <- training_msts2 %>% stlm(method = "arima")

Forecasting

n visitors for n receipts

holt_forecast <- forecast(model_holt_msts, 13*7)
autoplot(train_msts, series = "Actual") +
  autolayer(holt_forecast$mean, series = "ets prediction")

forecast_ets <- forecast(model_stlm_ets, h = 13*7)
autoplot(train_msts, series = "Actual") +
  autolayer(forecast_ets$mean, series = "ets prediction")

forecast_arima <- forecast(model_stlm_arima, h = 13*7)
autoplot(train_msts, series = "Actual") +
    autolayer(forecast_arima$mean, series = "Arima prediction")

n visitors for n foods ordered within a single transaction

holt_forecast2 <- forecast(model_holt_msts2, 13*7)
autoplot(train_msts2, series = "Actual") +
  autolayer(holt_forecast2$mean, series = "ets prediction")

forecast_ets2 <- forecast(model_stlm_ets2, h = 13*7)
autoplot(train_msts2, series = "Actual") +
  autolayer(forecast_ets2$mean, series = "ets prediction")

forecast_arima2 <- forecast(model_stlm_arima2, h = 13*7)
autoplot(train_msts2, series = "Actual") +
    autolayer(forecast_arima2$mean, series = "Arima prediction")

Evaluation

Accuracy

n visitors for n receipts

accuracy_holt <- forecast::accuracy(holt_forecast$mean, testing_msts)
accuracy_ets <- forecast::accuracy(forecast_ets$mean, testing_msts)
accuracy_arima <- forecast::accuracy(forecast_arima$mean, testing_msts)
summary <- rbind(accuracy_holt, accuracy_ets, accuracy_arima)
rownames(summary) <- c("HoltWinters Accuracy", "ETS Accuracy", "Arima accuracy")
summary
##                             ME     RMSE      MAE  MPE MAPE      ACF1 Theil's U
## HoltWinters Accuracy -3.068692 8.252864 6.547467 -Inf  Inf 0.4632981         0
## ETS Accuracy         -4.964592 8.707482 7.044089 -Inf  Inf 0.3511162         0
## Arima accuracy       -2.067616 7.404495 5.714110  NaN  Inf 0.3442368         0

n visitors for n foods ordered within a single transaction

accuracy_holt2 <- forecast::accuracy(holt_forecast2$mean, testing_msts2)
accuracy_ets2 <- forecast::accuracy(forecast_ets2$mean, testing_msts2)
accuracy_arima2 <- forecast::accuracy(forecast_arima2$mean, testing_msts2)
summary2 <- rbind(accuracy_holt2, accuracy_ets2, accuracy_arima2)
rownames(summary2) <- c("HoltWinters 2 Accuracy", "ETS 2 Accuracy", "Arima 2 accuracy")
summary2
##                                ME     RMSE      MAE  MPE MAPE      ACF1
## HoltWinters 2 Accuracy  -6.307412 17.73203 14.17475 -Inf  Inf 0.2951527
## ETS 2 Accuracy         -13.824325 21.27136 17.28505 -Inf  Inf 0.2288132
## Arima 2 accuracy        -4.327900 16.67206 13.15267  NaN  Inf 0.2183244
##                        Theil's U
## HoltWinters 2 Accuracy         0
## ETS 2 Accuracy                 0
## Arima 2 accuracy               0

Based on the accuracy results above, it can be deduced that the second data frame in which we estimate the number of visitors by considering the amount of foods ordered in a single receipt is not a suitable data frame as reflected on their extremely high MAE and RMSE. Hence, we will not consider them further for our analysis.

Visualization of actual against model prediction

accuracy_data <- data.frame(date = train_new$transaction_date %>% tail(13*7),
  actual = as.vector(testing_msts) ,
  holt = as.vector(holt_forecast$mean) ,
  ets = as.vector(forecast_ets$mean),
  arima = as.vector(forecast_arima$mean))
accuracy_data %>% 
 ggplot() +
  geom_line(aes(x = date, y = actual, colour = "Actual"),size=0.5)+
  geom_line(aes(x = date, y = holt, colour = "Holt Winter Model"),size=0.3)+
  geom_line(aes(x = date, y = arima, colour = "Arima Model (Best Model)"),size=0.5)+
  geom_line(aes(x = date, y = ets, colour = "ETS Model"), size = 0.3) +
  labs(title = "Hourly Visitors - Actual Vs All Models",x = "Date",y = "Visitor",colour = "")

Prediction on data-test.csv file

In this section, we are going to train an arima model using our data-train.csv file to predict the number of visitors for the next week, which then will be saved to a file, and evaluated using Algoritma’s system.

model_arima_test <- stlm(train_msts, method = "arima")
arima_forecast_test <- forecast(model_arima_test, 13*7)
test$visitor <- round(arima_forecast_test$mean)
test %>% head()
write.csv(test, file = "test_result.csv")
knitr::include_graphics("result.png")

The above image shows that we have successfully accomplished on of the main tasks provided by Algoritma, reaching MAE below 6.

Conclusion

No auto-correlation assumption

acf(model_arima_test$residuals)

As the lag 1 does not surpass the top and bottom dotted-blue line limit, then auto-correlation assumption is fulfilled.

Box.test(model_arima_test$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  model_arima_test$residuals
## X-squared = 0.0081534, df = 1, p-value = 0.9281

As the p-value is above 0.05, it does not have auto-correlation.

Normality of residual assumption

shapiro.test(x = model_arima_test$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_arima_test$residuals
## W = 0.99114, p-value = 0.000006817

However, from the Shapiro_Wilk test,it can be seen that the p-value is lower than 0.05, therefore the residuals are not distributed normally. This might indicates that we cannot ensure that the error will always be consistent for future analysis. This phenomenon might happen from the lack of amount of data that we have.

Keypoints

  • The restaurant usually has the most visitors on Saturday, implying that they might want to allocate more resources during this day.
  • The peak hourly time for the amount of visitors is around 8pm every day, meaning that they can expect more visitors coming around that time period.
  • The restaurant is usually the least crowded at 10am.